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Abstract — In this paper, we consider the problem of multi- 
armed bandits with a large, possibly infinite number of corre- 
lated arms. We assume that the arms have Bernoulli distributed 
rewards, independent across time, where the probabilities 
of success are parametrized by known attribute vectors for 
each arm, as well as an unknown preference vector, each of 
dimension n. For this model, we seek an algorithm with a total 
regret that is sub-linear in time and independent of the number 
of arms. We present such an algorithm, which we call the Two- 
Phase Algorithm, and analyze its performance. We show upper 
bounds on the total regret which applies uniformly in time, for 
both the finite and infinite arm cases. The asymptotics of the 
finite arm bound show that for any / G w(log(T)), the total 
regret can be made to be Q (n ■ f(T)). In the infinite arm case, 
the total regret is 0(Vn 3 T). 

I. INTRODUCTION 

A. Motivation 

The stochastic multi-armed bandit problem is the follow- 
ing: suppose we are allowed to choose to "pull," or play, 
any one of m slot machines (also known as one-armed 
bandits) in each of T timesteps, where each slot machine 
generates a reward according to its own distribution which 
is unknown to us. The parameters of the reward distributions 
are correlated between machines, but the rewards themselves 
are independent across machines, and independent and iden- 
tically distributed across timesteps. The choice of which arm 
to pull may be a function of the sequence of past pulls and the 
sequence of past rewards. If our goal is to maximize the total 
reward obtained, taking expectation over the randomness 
of the outcomes, ideally we would pull the arm with the 
largest mean at every timestep. However, we do not know 
in advance which arm has the largest mean, so a certain 
amount of exploration is required. Too much exploration, 
though, wastes time that could be spent reaping the reward 
offered by the best arm. This exemplifies the fundamental 
trade-off between exploration and exploitation present in a 
wide class of online machine learning problems. 

We consider a model for multi-armed bandit problems 
in which a large number of arms are present, where the 
expected rewards of the arms are coupled through an un- 
known parameter of lower dimension. Now, it is no longer 
necessary for each arm to be investigated in order to estimate 
the expected reward from that arm. Instead, we can estimate 
the underlying parameter; in this way, each pull can yield 

Research supported in part by AFOSR MURI FA 9550-10-1-0573. 



information about multiple arms. We present a simple algo- 
rithm, as well as bounds on the expected total regret as a 
function of time horizon when using this algorithm. While 
possibly sub-optimal, these bounds are independent of the 
number of arms. 

This model is applicable to certain e-commerce appli- 
cations: suppose an online retailer has a large number of 
related products, and wishes to maximize revenue or profit 
coming from a certain set of customers. If the preferences 
of this set of customers are known, the list of items which 
are displayed can be sorted in descending order of expected 
revenue or profit. However, we may not know a priori what 
this preference vector is, so we wish to learn online by 
sequentially presenting each user with an item, observing 
whether the user buys the item, and then updating an internal 
estimate of the preference vector. 

As a concrete example, imagine an online camera store, 
with hundreds of different camera models in stock. However, 
there are perhaps closer to ten features which people will 
compare when deciding which, if any, to purchase. There are 
permanent features of the camera itself, such as megapixel 
count, brand name, and year of introduction, as well as 
extrinsic features, such as price, review scores, and item 
popularity. All of these features might be considered by the 
customer in order to decide whether or not to buy the camera. 
If bought, the store gains a profit corresponding to the item. 
A key distinction of our model, when compared to previous 
work, is the incorporation of this inherently binary choice 
customers are faced with: to buy or not to buy. 

B. Model 

Our model consists of a multi-armed bandit with a set U 
consisting of m arms (items) and n underlying parameters 
(attributes), where m > n and potentially m > n. We will 
interchangeably also think of U as being a n x m matrix, 
where each arm u is an n-dimensional attribute vector, and 
is one of the columns of U. Furthermore, we will assume 
that rank(i7) = n. There is also a constant but unknown 
preference vector z* e R™. The quality (3 U = u T z* of arm 
it is a scalar indicating how desirable the item is to a user. 
We will use the logistic function / to define the expected 
reward of an arm u, assuming a particular z, as 

Oi u {z) = f (u T z) = — jr-. 

v ' 1 + cxp (— u 1 z) 



Thus, the expected rewards of all of the arms are cou- 
pled through z*. For notational simplicity, we define 
a* u = a u (z*) . Let the set of equally best arms be 



V 



max a, 

ueu l 



C U. 



Define the expected reward of a best arm to be 

a* v = max<. 

u£ U 

At each timestep t up to a finite time horizon T, a policy 
will choose to pull exactly one arm, call this arm Ct, and a 
reward X t will be obtained, where X t ~ Ber(a^. t ). We wish 
to find policies g which maximize the total expected reward, 
Y^t—i Xt, or equivalently, minimize the expected total regret, 

E g [EL K ~ Xt)] = T-a* v - E g [£ti 
C. Prior Work 

For an introduction and survey of classical multi-armed 
bandit problems and their variations, see Mahajan and 
Teneketzis [1]. One of the earliest breakthroughs on the 
classical multi-armed bandit problem came from Gittins and 
Jones [2], who showed that under geometric discounting, the 
optimal policy assigns an index to each arm, now known 
as the Gittins index, and pulls the arm with the largest 
Gittins index. Other proofs of this optimality have been 
given later by Weber [3] and Tsitsiklis [4]. Whittle [5] 
proved that a similar index-based result is nearly optimal 
in the "restless bandit" variation of this model, where the 
arms which are not pulled also evolve in time. While these 
policies greatly simplify a single m-dimensional problem 
into m 1 -dimensional problems, it is still, in general, too 
computationally complex for online learning. 

Lai and Robbins [6] proved an achievable 0(m ■ logT) 
lower bound for the expected total regret of the stochastic 
multi-armed bandit problem in the case of independent 
arms. Related work by Agrawal et al. [7], [8], [9], [10] 
and Anantharam et aJ. [11], [12] considered similar models 
with i.i.d. and Markov time dependencies for each arm, 
constructed index policies which are computationally much 
simpler, and extended the results to include "multiple plays" 
and "switching costs". 

Abe etaJ. [13] and Auer [14] considered models with finite 
numbers of arms, with reward distributions that are correlated 
through a multi-variate parameter z of dimension n, and 
obtained upper bounds on the regret of order O(vmf) 
and 0(VnT ■ logT), respectively. Mersereau et aJ. [15] 
considered a model in which the expected rewards are affine 
functions of a scalar parameter z, but allowed the set of 
arms to be a bounded, convex region in K™, in which 
case m is uncountably infinite. They then derived a policy 
whose expected total regret is 0(VT). Rusmevichientong 
and Tsitsiklis [16] expanded this model to allow for a multi- 
variate parameter z of dimension n, and showed that the 
expected total regret (ignoring logT factors) is 0(n\/T). 
Dani et aJ. [17] independently considered a nearly identical 
model, and obtained similar results. Kleinberg et al. [18] 
considered a model in which the deterministic rewards are a 



Lipschitz-continuous function of the n-dimensional vector 
corresponding to each arm, and obtain an expected total 
regret (ignoring logT factors) of 0(T"+ 2 ). 

Auer etaJ. [19] considered a non-stochastic version of the 
multi-armed bandit problem, in which the rewards are no 
longer drawn from an unknown distribution, but can instead 
be adversarially generated. The resultant total weak regret, 
calculated by comparison with the single arm which is best 
over the entire time horizon, is shown to be 0(y/mT). The 
change from logarithmic to polynomial regret in this model 
is due to having rewards which are time-dependent and 
potentially adversarially generated, instead of being drawn 
from a time-independent distribution. 

Audibert et al. [20] considered the problem of best arm 
identification in a stochastic multi-armed bandit setting, but 
where the goal is to maximize the probability of determining 
the best arm at the end of a time horizon, as opposed 
to the usual goal of minimizing total regret over a time 
horizon. This model is useful when considering exploration 
and exploitation as occurring in series, instead of in parallel. 
The probability of error is shown to be upper bounded by a 
decaying exponential in T. 

Auer et al. [21] investigated the finite-time regret of the 
multi-armed bandit problem, assuming bounded but other- 
wise arbitrary reward distributions. Using upper confidence 
bound (UCB) algorithms, where the confidence interval of 
an arm shrinks as the arm is subjected to more plays, they 
achieve a logarithmic upper bound on the regret, uniform 
over time, that scales with the "gaps" between the expected 
rewards for the arms. One algorithm they propose, UCB2, 
selects the arm with largest empirical mean plus confidence 
interval, plays it for a number of timesteps dependent on 
how often that particular arm has been selected in the past, 
and repeats this process until the time-horizon is reached. 
This achieves asymptotically optimal expected total regret, 
and has the best constant possible. 

A common idea used in crafting policies to solve the 
multi-armed bandit problem is that of the doubling trick 
[22], [23]. This technique is used to convert a parametrized 
algorithm which works on a time horizon T, along with its 
corresponding bound, into a non-parametrized algorithm that 
runs forever, with an upper bound that holds uniformly over 
time. 

II. TWO-PHASE ALGORITHM 

We first present an algorithmic description of a policy for 
the multi-armed bandit problem described in Section LB. 
This algorithm, which we call the Two-Phase Algorithm, will 
depend on a scheduling function g : Ni —> Nq , such that g 
is strictly increasing. Since g is not surjective in general, 
its inverse <? _1 is not defined over all of No; however, we 
can extend the inverse image in the natural way to preserve 
monotonicity, by defining g^ 1 : No — >Ni, 

g- l (t) = max{l U{(eN i: g(l) < t}} . 



In Theorem |3.5| we will show an upper-bound to the ex- 
pected total regret of this policy on finite arms, which is 



independent of the number of arms m. In Theorem 4.6 we 



will show an upper-bound to the expected total regret of this 
policy on a special case when there are uncountably infinite 
arms. 

Algorithm 1 Two-Phase Algorithm 
Require: Set of all arms U 

Require: Set of n chosen arms £ = {Si, . . . , £„} C U, 
s.t. £ has rank n 

Require: Scheduling function g : Ni —> No, strictly increas- 
ing 



then we call epoch / a good epoch, and form the current best 
estimate for z* , 
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ti-l,l<-l 

q u <- 0,Vu e E 
loop 
for u e £ do 
Pull arm Ct 

Qc t *- qc t + 
t<-t+l 
end for 



obtain reward X t {Phase 1} 



Form the estimates a u j < — , Vu £ S 

if q„;£ (0,1), Vug £ then 



zt <- 
else 
end if 
for s 4 



r 1 (&e b ,o 



argmax uS j/ a„(z;), settling ties arbitrarily 
- 1 to fli(Z) do 
Pull arm Ct Cm, obtain reward X t {Phase 2} 
t<-t + l 
end for 
/ <- I + 1 
end loop 



The algorithm requires a selection of n arms, 

£ = {Si, . . . , £„} C J7, s.t. £ has rank n. 

Such a choice exists since we assume U has rank n. 
The algorithm proceeds in epochs; epoch I consists of n 
exploration pulls (called Phase 1), one for each arm in £, 
and g(l) exploitation pulls (called Phase 2). In other words, 
Phase 1 refines our estimate of z* , and Phase 2 repeatedly 
pulls the best arm given our current estimate zi. If we impose 
a time horizon of T, epochs 1,2, ... ,L are appended until 
the time horizon T has been reached. The two phases are 
illustrated in Figure [T] 

For each timestep t in Phase 1, an arm u G £ is chosen, 
and the empirical count of successes q u is incremented if 
X t = 1, Prior to each Phase 2 timestep during epoch I, 
there have already been I Phase 1 pulls. We can then form 
empirical estimates for a* based on the Phase 1 timesteps, 



namely a u , 



1 



Vu G £. If 



k = (£ 2 



r X (aiw) 



a«j G (0, 1), Vu G £, 



since / being strictly increasing and continuous implies / 1 
exists on (0, 1), and since £ being an n x n matrix with full 
rank implies (£ T ) exists. Otherwise, we call epoch I a bad 
epoch, and let Bi = 0„. Define the event Gi to mean that 
epoch / is a good epoch. Note that G; =>■ G;+; Vi G Ni. 
Then, choose an arm 

C(0 = argmaxa„(2/), 

settling ties arbitrarily, and pull this arm g(l) times to form 
the current epoch's Phase 2. 

Remark 2.1: In practice, LU decomposition, instead of 
matrix inversion, can be used to solve for Also, since 
/ is strictly increasing, the estimated best arm in a good 
epoch I can be computed as 

C(t) = argmax [u T zi) . 

We shall point out some of the ideas behind this algorithm. 
First, the algorithm is defined to run indefinitely; to obtain 
the total regret for any finite time horizon T, we simply 
terminate the algorithm when timestep T has been reached. 
This achieves the same outcome as an application of the 
doubling trick, in that the algorithm is not dependent on a 
time horizon T. Our algorithm is similar to the algorithm 
UCB2 of [21]. The main difference is that in our exploration 
phases, the choice of arm exploits the correlation model 
that we have assumed in our problem. Furthermore, as we 
will see later, unlike UCB2, the lengths of the exploitation 
phases are chosen to grow sub-exponentially in the epoch 
number (e.g., g(l) G exp (o(Z)) for finite arms) in order 
to obtain a regret bound that grows slightly faster than 
logarithmically in the time horizon (e.g., E[Rt] G w(log(T)) 
for finite arms). As we gain more information and are able to 
estimate z* more accurately, we can spend a greater fraction 
of timesteps exploiting the arm we think is best; this is 
achieved by choosing a suitable scheduling function g to 
control the ratio of the number of exploitation (Phase 2) 
pulls versus exploration (Phase 1) pulls, as a function of the 
epoch number I. 

Note that there is only randomness in the outcomes 

rp 

{Xt} t — V since the Two-Phase Algorithm is deterministic in 
the selection of the arm C\, conditioned on the history. We 
will use w to denote the sample-paths of {X t }J =1 . Let L 
denote the number of epochs (including partial epochs, as 
the final one may be truncated) up to timestep T, which is 
independent of sample-path u. While L is actually a function 
of T, we will not write this dependence explicitly. 

Define the expected regret in a single Phase 2 timestep in 
epoch / to be E[r2,i\. Note that this value is the same for 
every Phase 2 timestep in epoch I, and hence is independent 
of timestep. Define the total regret up to timestep T in 



T 







Number of Timesteps: n g{\) n g(2) n g(3) 
Phase: 12 12 1 2 



n 
1 



9{L) 
2 



Epoch: 



1 



Fig. 1 . Given a time horizon T, we partition the T timesteps into Phase 1 and Phase 2 timesteps, grouped into a total of L epochs. 



the Phase i timesteps for a sample-path u to be R^t (w) . 
Define the total regret up to timestep T for a sample-path to 
to be R T (uj) = Ri t (uj) + i? 2 ,T ( w ) • Our goal is to find 
an upper-bound on E[Rt], the expected total regret. In 
particular, we are interested in the asymptotic behavior of 
the upper-bound as T — > oo. 



III. ANALYSIS, FINITE ARMS 

Consider the multi-armed bandit problem described in 
Section I.B, where the m arms can each take on any value 
in W 1 , as long as U is full rank. 

A. Upper-bound Results 

Lemma 3.1: For the Two-Phase Algorithm, we have the 
following bound on the expected total Phase 1 regret up to 
timestep T: 

E [R hT ] < a* v nL. 

Proof: 



E [Ri, T ] < E 



,1=1 ugS 



□ 



"3 



< otynL. 



□ 



□ 



Ui 



U2 
□ > 



Fig. 2. As an example, consider a scenario with n = 2 and m = 3. The 
arms U = 1^2,^3} and the preference vector z* are located at the 
indicated points. The shaded region is A; the boundary of A is formed by 
the perpindicular bisectors of the segments 113, u\ and u\,U2- 



Lemma 3.2: The probability that epoch I is a bad epoch 
is upper-bounded by 

2rc-exp{-2/-/(-||**||)}. 
Proof: In order for epoch I to be a good epoch, we have the 
condition that a u ; £ (0, 1) Vu € S. Consider the condition 
for a bad epoch: 

3u e E s.t. a u j i (0, 1) 



3u e E s.t. 



> 



3u € E s.t. |d Mj z — a* | > 



3u G E s.t. |ou j — a* I > - — max 

2 ues 



Note that 



max 



< f max 



Then, applying the union bound and Chernoff bound, it 
follows that 



P(3i e E s.t. &i t i i (0,1)) 
<2n ■ exp I -21 



1 — / ( max ||u|| • ||z*|| 



=2n ■ exp < — 2Z ■ / — max ||u|| • ||z*| 



Let ki = 2/ (max„ 6 £ |ju|| • ||^*||). Thus, the probability that 
epoch I is a bad epoch is then upper-bounded by 



2n ■ exp(-kil). 



□ 



Lemma 3.3: For the Two-Phase Algorithm on finite arms, 
for a given choice of scheduling function 

g s.t. g(l) e exp(o(/)) , 



we have the following bound on the expected Phase 2 regret 
per timestep in a good epoch I: 

E [r 2 ,i\ G i] < 2a *v n ■ ex P (-70 > 

where 7 is a constant which depends on £7 and 2*. 
Proof: Recall that a* = / (u T z*), where 



/(/3) = 



1 

l+exp(-/3) 



is strictly increasing and continuous. Thus / 1 is well 
defined, strictly increasing and continuous. Recall that 



V 



v G U : a* 



max a, 



is the set of equally best arms. Because / (ufz) is contin- 
uous in z and defined over W 1 , it follows that there exists a 
neighborhood of z* , denoted A, such that 



.4 



z e 



argmaxa n (z) e V 



Since E is full rank, A must contain an open parallelotope 
centered at z* , 



B z « (6) = {ze 



\X T 2 



E T ; 



<*} 



where 5 > and is largest possible. An example of the 
problem parameters and the induced region A is shown in 
Figure [2] 

Consider any z G B z * (5). By definition, 

\u T z — u T z*\ < 8, Vu e E. 
This is equivalent to 

If- 1 (a u (z)) - r 1 (a* u )\ <S, V U £E. 

Since / _1 is continuous, this is equivalent to having a set of 
constants 



where 



«„ = /(/ 1 K) - <*) and 

a u = / (r 1 «) + *) , VueE. 

For a Phase 2 timestep during a good epoch /, the algorithm 
forms the empirical average rewards 

a v ,i G (0,1), Vue E. 

By the discussion above, 

a u < & u ,i <a u , Vu e E 
=> «l G 5 Z .(<5) C 4 

C t = argmax{u T z;} G V 

uG£/ 

and we will have chosen one of the best arms, accumulating 
zero regret. 



Note that during epoch I, a u j is a sum of I i.i.d. Ber (a* ) 
random variables, Vu G E. By the Chernoff bound, 

P(a u j <a u \Gi) <exp[-l- D(a u \\a* u )}, and 
P(a u ,z > a„|G/) <exp[-^ • D (a u \\a* u )] , Vu e E, 



P 1 — P 
where D (p\\q) = p ■ log — \- (1 — p) • log is the K-L 

divergence between two Bernoulli distributions. 

Let 7 = miii„ e s min {D (a u \ | a* ) , D (a„ 1 1 a* ) } . Note 
that from the definitions of a?; and 5j, it follows that 



Sli < a M < a n, Vu e E. 

Since -D (p\\q) = <^=> p = q, we have that 7 > 0. By 
the union bound, 

P(3u G E : a uJ £ {a u ,a u ) \G t ) < 2n ■ exp (-7Z) . 
Reviewing the chain of implications, we have 

<P(^ ^S^(«)|G,) 

=P(||E T ^-E T z*|| oo >«J|G0 

=P(3u G E : \u T zi - u T z* \ > S\G t ) 

=P (3u G E : I/" 1 «,) - f' 1 «)| > 5|Gj) 

=P (3u G E : a u< i £ (a u ,a u ) |G ; ) 

<2n • exp (—7/) . 

Then, we have a bound on the expected per-timestep regret 
T2,i during Phase 2 of epoch I: 

E [r 2jI |G,] =£ [r 2j( |5i G A, G,] • P (z z G A|G,) 

+ B[r 2 ,i|ii £ AIGA-P&t Aid) 
<0 • P (zi G A\Gi) +a* v -P{zii A\G{) 
<2ayn ■ exp (— jl) . 

□ 

Lemma 3.4: For the Two-Phase Algorithm on finite arms, 
for a given choice of scheduling function 

g s.t. g(l) G exp(o(Z)) , 

we have the following bound on the expected total Phase 2 
regret up to timestep T: 

E[R 2 ,t] <a* v n{2k 2 + L), 

where k 2 is a constant which depends on U and z*. 



Proof: By Lemmas 3.2 and 3.3 we can upper-bound the 



expected total Phase 2 regret, 

E{R 2 ,t] 

L 



< 



{ [P(Gi) ■ E[r it i\Gi] + P(-.G,) • E[^hG,]] ■ g(l)] 

i=i 

L 

< {[P{Gi) ■ 2a* v n ■ exp (- 7 Z) + P^G,) ■ a* v ] ■ g(l)} 



i=i 

L 



< 



^ {[2a.yn ■ exp (—7Z) + 2ayn ■ cxp (— kil)] ■ g(l)} 



i=i 



V 



<2a* v n J2 {[exp (-7O + exp (-k t l)] ■ g(l)} 



1=1 



L 1 

y - 

l=L' + l 
L' 



<a* v n I 2 J2 {[exp (-7O + cxp (-fci*)] ■ 5(0} + L 



1=1 



where 



L' = max < I : [exp (— 7 2) + cxp (— fe]./)] • ^(Z) > 



is a constant, independent of sample-path, that depends on U 
and z* (and is therefore unknown to the algorithm). However, 
since we have assumed g(l) g exp(o(Z)), it follows that 

lim [exp (—7/) + exp (—kit)] ■ g(l) = 0, 

and thus L' is finite. Let 

V 

k 2 = {[° x p (-to + ex p (- ki ^ ■ -9(01 . 

1=1 

which is well defined since V is finite. Thus, 

E[R 2 ,t] < a* v n{2k 2 + L). 

□ 

Theorem 3.5: For the Two-Phase Algorithm on finite 
arms, for a given choice of scheduling function 

g s.t. g(l) e cxp (o(t)) , 

we have the following bound on the expected total regret up 
to time-horizon T: 

E [R T ] < 2a* v n (k 2 + g~ l (T) + l) . 
Proof: Since the final epoch may be only partially finished, 
we will lower-bound the total time with the number of 
timesteps in the penultimate epoch's Phase 2, 

L-l 

T>Yin + g(i)}>g(L-i). 



1=1 



Equivalently, 



L < 



L (T) + 1. 



Then, using Lemmas |3.1| and |3.4| 

E[Rt] —E[Ri^x] + E[R 2 t] 
<2a* v n (k 2 + L) 
<2a* v n(k 2 +g- 1 (T) + l) 



a 



Corollary 3.6: For the Two-Phase Algorithm on finite 
arms, for a given choice of scheduling function 

g s.t. g(l) G exp (o(l)) , 

we have the following asymptotic bound on the expected 
total regret up to time-horizon T: 

E[Rr]eO(n>g- l {T)). 
Proof: By Theorem |3.5| as a function of T, 



E[R T ] <2a* v n(k 2 +g- 1 (T) + l) 
eO(n-g-\T)), 

since a'y < 1, k 2 > is a constant dependent only upon U 
and z*, and cT 1 ^) € w(l). 



□ 



Lemma 3.7: 

g~ l (t) ew(log(f)) 



Proof: 



lim 

i— f 00 



lim 

7>00 

= lim 

= 0, 



g(l) G exp(o(0). 

log(g(0) 
log(5(0) 



(1) 

(2) 
(3) 



where (1) is by making the substitution t = g(l), recalling 
that g : Ni — > No is strictly increasing by assumption, so 
lim/_j.oo g(t) =00. (2) is since by construction, 

g- 1 (g(l)) = l, V/gNl 
Lastly, (3) is since G oj (log(i)), so by definition, 

lim - — -—- = 00. 

t->oo lOg(f) 

Hence log (<?(/)) G o(i), so g(l) G exp(o(/)), and thus g is 
a valid scheduling function. 

□ 

Let log* (a;), the iterated logarithm function, be defined 
recursively by 



log* (a;) 



0, if x < 1 

1 + log* (log x) , if x > 1 



Corollary 3.8: The Two-Phase Algorithm can achieve 
E[J2r] eO(n- log(T). log* (T)). 
Proof: Choose 

g L Ls(l) = max{t G Ni : log(i) • Iog*(t) < 1} . 



Then, 



ffZi S (*) = Llog(*)-log*(t)J 1 



lim 

t->oo log(t) 



lim log*(t) — > oo. 



Thus, € w (log(t)), and by Lemma 3.7 and Corollary 

|3.6| we have an achievable expected total regret of 



E[R T ] G O (n- ffZi s (T)) C O (n ■ log(T) ■ log*(T)) 



□ 



Remark 3.9: In accordance with other results, such as 
[6], we suspect this problem has a lower bound that is 
asymptotically cn ■ log(T), where c is dependent on the 
problem parameters U and z*. If this is the case, then by 
including the term log*(T), we are able to obtain an upper 
bound which is not tight, but within a factor of log*(T), 
while avoiding a dependence on the problem parameters. 

B. Generalization to Arm- dependent Rewards 

Suppose that each arm u E U has a potentially different 
value of the reward, so that instead of a {0, 1} reward, it 
has a {0, w u } reward. Furthermore, suppose that {w u } u ^u 
is known. Several definitions must be generalized, namely in 
the model, 

X t ~ w Ct -Ber(ac t ) , 



V — <, v e U : w v a* v — maxw B a* > C U, 



A = { z £ M. n : &rg max w u a u (z) e V 



E[Ba 



and in the algorithm, 



E„ 



^2 (w u a* v - X t ) 



{=1 



Cm <- arg max «;„ 



Then, Theorem |3.5| generalizes with only minor modifica- 
tions to the proof, yielding 

E[Rr] < 2w v a* v n (k 2 + ^(T) + l) . 

Corollaries |3 ,6| and |3 . 8| also generalize, with the same results 
as before. 

IV. ANALYSIS, INFINITE ARMS 

Consider the multi-armed bandit problem described in 
Section LB, with each point on the unit sphere in 1" being 
an arm. Now, the number of arms is uncountably infinite, 
and the finite arm analysis from before no longer yields a 
useful bound; since there is no longer a gap between the best 
and second-best arms, the region A degenerates into a line, 
causing 7 = and k 2 = 00. 



A. Upper-bound Results 

For this special case of this infinite arms problem, we 
shall show that a total expected regret, up to time T, of 
O n 3 Tj is achievable. To obtain a meaningful bound in 
this case, we eliminate the dependence on 7, but the trade-off 
is a worse dependence on T. We obtain the O (yn?f) by 
analyzing the Two-Phase Algorithm with the choice of arms 
£ = {ei,...,e„}, the standard basis, and the scheduling 

function g(l) = [— J. 

The proof can Se decomposed into several parts. 

• First, we have already shown that the probability of an 
epoch being bad decreases exponentially in the epoch 
number, in Lemma |3.2| 

• Then, for a good epoch, the probability that a deviates 
from the true value a* also decreases exponentially in 
the epoch number. 

• Large deviations of this estimate can be related to large 
errors in the central angle between the estimated value 
z\ and the true value z* . 

• Large regret implies large deviations in this central 
angle. 

• Finally, the total expected regret can be bounded using 
the fact E[r 2 ,i] = ft P{r 2 ,i > 6)dS. 

Define 9; to be the central angle between zi and z* when 
zi 7^ 0„, and ir otherwise, i.e. 



e, = 



arccos 



■ \\z* 



k 7^ o„ 

h = On 



Recall that Gi denotes the event that epoch I is a good 
epoch. 

Lemma 4.1: 



r%i > S 



e, > 



Proof: 




T2,l > S 

f (Vf z*) - f ((«,) T **) > s 

■f((fii) T z*) >S 



f 



' (z* f z* 
11**11 



/(||zl)-/(cos(e0||2*||)><5 

-cos (90) > . * 

mm x f (x) 



= 45 




(1) 



(2) 



where (1) is since cos (9;) 
cos(a;) > 1 — , V.t. 



l^ll-lk 



and (2) is since 



Pi = proj s (**) 



2 -Pl\ 



>f < 



Fig. 3. Schematic diagram illustrating the relationships between — z*\\. 
\\z* - p;||, and 



□ 

Lemma 4.2: With the choice E = {ei, . . . , e n }, 

Qi > 9 => 3«eS s.t. \u T zi -u T z*\ > —= ■ \\z*\\ . 
Proof: Suppose that 8; > 6. There are two cases: 

6 ( <= [0,7r/2], or 6/ € (7r/2,7r]. 

In the first case, define p to be the vector projection of z* 
onto zf, see Figure [3] Then, 

\\zi~Z*\\ > 

= sin(eo • ||^*|| 

> sin(6») ■ \\z*\\ . 
In the second case, when G; > 7r/2, 

P*-**||>lk*||. 

Note that this is true even in the special case of zi — 0„, by 
our definition of 6;. Then, V0 G [0, 2tt], 

&i>6 



zi-z* >--\\z* 

7T 



3i e {1, 2, . . . ,n} s.t. 



(e;) T k - (ei) T z" 



3u e E s.t. 

I T a T * I 

\U Zl — u z > 



> 



□ 



Lemma 4.3: In a good epoch, 

3ii e E = {ei, e 2 , . . . , e n }, (5 2 € (0, 1], s.t. 

\u T zi~u T z*\ > S 2 ■ \\z*\\ 

=► |a^-<|>5 2 -|k*||-/'(2P*||). 
Proof: Suppose the current epoch is good, and 

3u e E = {ei, e 2 , . . . , e„}, <5 2 € (0, 1], s.t. 
|u T z, -u T z* \ > S 2 ■ \\z*\\ . 



Since the current epoch is good, / 1 (a u j) exists Vu € E. 
By construction, E is full rank, and 



Thus, Vw S E, 



Ou.i = / (u T k) and 
< = /(«V). 

Let (3 = u T z* + 5 2 ■ ||2*|| • sgn (u T 2; - u T z*). Then, 



- a u\ 



f (u T Zl ) - / (u T z*) 



~T * 

u z 



min f'(x) 

16 [^,M T Z*1 



(1) 

(2) 



><5 2 -||z*||- min f'(x) (3) 

ze[-2||z*||,2||z*||] 

=S 2 ■ \\z*\\ •/' (2 11*1), (4) 
where (1) is since / is strictly increasing and 

P e [u T z*, u T zi] , 
(2) is since /' is continuous, (3) is since S 2 < 1 and 

\\u T z*\\ < \\z*\\ Vu G E, 

and thus \^,u T z* c [-2||z*|| ,2||z*||], and (4) is since /' 
is unimodal and symmetric. 



□ 



Lemma 4.4: 



<2n ■ exp ( -21 ■ (83) 

Proof: Note that E[a ud ] = a*, V7 € Ni,w € E. Then, for 
any w g E, by the Hoeffding bound, 

P(\&u,l ~ <l > fc) < 2 • cxp (-2/ • (5 3 ) 2 ) . 

Applying a union bound over all u e E yields the desired 
result. 



□ 



Lemma 4.5: 

E[r 2 ,i\Gi] < 



81- 11**11 •/' (2 \\z*\\Y 



Proof: Combining Lemmas 4.1 4.2 4.3 and 4.4 we have 
that 



P(r 2< i > S\ Gi) 

<p(e,>M 



\G, 



<P I 3u e S : \u T zi - u r z* \ > 



8fi 



\Gi 



r^" -«-xi»(-i- I^.||*»||./'(2||**||) ! 



Let fc3 denote the constants independent of I and n, namely 
1G 



h = 



i*i-/'(2ikir 



Then, 



E[r n \G l } = f P(r 2l i>6\Gi)dS 
Jo 



< 2n ■ exp — <5 ■ 



(16 



= 2n ■ 



exp ( — <5 ■ 



hi 



< 



2n^_ 



□ 

Theorem 4.6: For the Two-Phase Algorithm on a unit 
sphere of arms in M. n , we have the following bound on the 
total expected regret up to time-horizon T: 



E [R T ] < 1 



nL + 2 I ■ exp (— k^l) . 



Proof: Since the final epoch may be only partially finished, 
we will lower-bound the total time with the number of 
timesteps in all prior epochs. Decomposing into Phase 1 and 
Phase 2, we have 



L-l f 



> 



> 



L-l 

E- 

1=1 

(L-l) 



2n 



Equivalently, 



L < V2nT + l. 



By Lemmas 3.2 and 4.5 we can upper-bound the expected 



total Phase 2 regret, 



2,T 



< 



\p(Gi) ■ E[ra,j|Gi] + P(-G z ) • E[r 2 ^GA ■ [-J 
L in 



2n 2 
k 3 l 



E[R 

L 

E 

i=i 

L 

^E 

i=i 

L 

^E 

i=i 

2nL 

< 



Then, using Lemma [3TT] 

E[Rt] —E[Ri,t] + E[R.2.t] 
2 



2n? 
~kj 



2n ■ exp (—k\l) 



2^1 ■ cxp(-kil) . 
i=i 



< a 



nL 



2^1 -exp(-fcxZ) 
i=i 



<K+r 



T 



2^1 ■ exp(-M) , 



□ 



Corollary 4.7: For the Two-Phase Algorithm on a unit 
sphere of arms in K™, we have the following asymptotic 
bound on the expected total regret up to timestep T: 

E [R T ] £ O (Vn?T\ . 
Proof: By Theorem |4.6| as a function of T, 

E[R T ] = (a* v + |-] -(y/toPT- 



2^/-exp(-fci0 

i=i 



G O Vn 3 T 



since a v < 1, and ki,k% > are both constants dependent 
only upon \\z*\\. 

□ 

Remark 4.8: The choice of scheduling function g is not 
restricted to be I ^J; the dependence on n can be altered to 
change the trade-off between the constants in front of vT 
and YaZi I ' exp(— kil). That is, the asymptotics can be 
improved at the expense of short time-horizon performance. 
Furthermore, if the time-horizon is known in advance, then 
the scheduling function can be chosen to minimize the sum 
of these two terms, just as in the finite arm case. 

V. CONCLUSIONS 

We have proposed a class of parametrized multi-armed 
bandit problems, in which the reward distribution is Bernoulli 
and independent across arms and across time, with a param- 
eter that is a non-linear function of the scalar quality of an 



arm. The real-valued qualities are inner products between 
the unknown preference and known attribute vectors. Under 
this model, we are able to capture the fundamentally binary 
choice inherent in certain online machine learning problems. 

Our proposed algorithm achieves an asymptotic ex- 
pected total regret of 0(n-g _1 (T)) for any function 

,9 _1 (T) e uj (log(T)) in the finite arm case, and (Vn 3 T^j 
in the infinite arm, unit circle case. This is in contrast to 
the £1 (mlog(T)) lower-bound of Lai and Robbins, and the 

(t™+?^ lower-bound of Kleinberg et al. In both cases, 
the additional assumption of structure (linearly correlated 
instead of independent, and logistic function of linear in- 
stead of Lipschitz, respectively) can be used to out-perform 
optimal algorithms which do not account for this structure. 
We conjecture that the lower-bounds on our problem are 
Q(n ■ log(T)) and f2(y^(T)) for the finite and infinite arm 
cases, respectively; if true, then this simple algorithm's 
performance is nearly optimal. 

Finally, our algorithm can be implemented very efficiently, 
since the storage requirements are 0(n) and thus do not 
scale with either the number of arms or the time-horizon. 
Also, since the exporation and expoitation phases are de- 
coupled, the only history-dependent part of the algorithm, 
the optimization to determine which arm to pull, is only 
performed during a small number of timesteps (approxi- 
mately 0(log(T)) and 0(VT) for finite and infinite arm 
cases, respectively). In the infinite arm, unit circle case, this 
optimization itself is simply the normalization of the current 
estimate 

The basic idea of increasing the length of epochs is similar 
to that of UCB2, but because our algorithm uses a global 
count of the epoch instead of local counts for each arm, it 
is applicable to infinite arm problems. Finally, we note that 
several extensions to this work are possible; multiple plays 
and time-dependent U and z* would be directly applicable 
for e-commerce applications. 
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